#pip install autogluon
해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임
02wk-005: 타이타닉, Autogluon
최규빈
2023-09-12
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-zZrOGpLc8spPa9L39RiNhR&si=TFl5m9-VohYT_47L
2. Import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv
from autogluon.tabular import TabularDataset, TabularPredictor
3. 분석의 절차
A. 데이터
-
비유: 문제를 받아오는 과정으로 비유할 수 있다.
= TabularDataset("~/Desktop/titanic/train.csv")
tr = TabularDataset("~/Desktop/titanic/test.csv") tst
Loaded data from: ~/Desktop/titanic/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
Loaded data from: ~/Desktop/titanic/test.csv | Columns = 11 / 11 | Rows = 418 -> 418
type(tr)
autogluon.core.dataset.TabularDataset
B. Predictor 생성
-
비유: 문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.
TabularDataset??
Init signature: TabularDataset(data, **kwargs) Source: class TabularDataset(pd.DataFrame): """ A dataset in tabular format (with rows = samples, columns = features/variables). This object is essentially a pandas DataFrame (with some extra attributes) and all existing pandas methods can be applied to it. For full list of methods/attributes, see pandas Dataframe documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html Parameters ---------- data : :class:`pd.DataFrame` or str If str, path to data file (CSV or Parquet format). If you already have your data in a :class:`pd.DataFrame`, you can specify it here. Attributes ---------- file_path: (str) Path to data file from which this `TabularDataset` was created. None if `data` was a :class:`pd.DataFrame`. Note: In addition to these attributes, `TabularDataset` also shares all the same attributes and methods of a pandas Dataframe. For a detailed list, see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html Examples -------- >>> from autogluon.core.dataset import TabularDataset >>> train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') >>> train_data.head(30) >>> train_data.columns """ _metadata = ["file_path"] # preserved properties that will be copied to a new instance of TabularDataset @property def _constructor(self): return TabularDataset @property def _constructor_sliced(self): return pd.Series def __init__(self, data, **kwargs): if isinstance(data, str): file_path = data data = load_pd.load(file_path) else: file_path = None super().__init__(data, **kwargs) self.file_path = file_path File: ~/anaconda3/envs/py38/lib/python3.8/site-packages/autogluon/core/dataset.py Type: type Subclasses:
- class상속!! Dataframe의 기능을 다 쓸 수 있다.
TabularPredictor?
Init signature: TabularPredictor( label, problem_type=None, eval_metric=None, path=None, verbosity=2, log_to_file=False, log_file_path='auto', sample_weight=None, weight_evaluation=False, groups=None, **kwargs, ) Docstring: AutoGluon TabularPredictor predicts values in a column of a tabular dataset (classification or regression). Parameters ---------- label : str Name of the column that contains the target variable to predict. problem_type : str, default = None Type of prediction problem, i.e. is this a binary/multiclass classification or regression problem (options: 'binary', 'multiclass', 'regression', 'quantile'). If `problem_type = None`, the prediction problem type is inferred based on the label-values in provided dataset. eval_metric : function or str, default = None Metric by which predictions will be ultimately evaluated on test data. AutoGluon tunes factors such as hyperparameters, early-stopping, ensemble-weights, etc. in order to improve this metric on validation data. If `eval_metric = None`, it is automatically chosen based on `problem_type`. Defaults to 'accuracy' for binary and multiclass classification, 'root_mean_squared_error' for regression, and 'pinball_loss' for quantile. Otherwise, options for classification: ['accuracy', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_weighted', 'roc_auc', 'roc_auc_ovo_macro', 'average_precision', 'precision', 'precision_macro', 'precision_micro', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_weighted', 'log_loss', 'pac_score'] Options for regression: ['root_mean_squared_error', 'mean_squared_error', 'mean_absolute_error', 'median_absolute_error', 'mean_absolute_percentage_error', 'r2'] For more information on these options, see `sklearn.metrics`: https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics For metric source code, see `autogluon.core.metrics`. You can also pass your own evaluation function here as long as it follows formatting of the functions defined in folder `autogluon.core.metrics`. For detailed instructions on creating and using a custom metric, refer to https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-custom-metric.html path : Union[str, pathlib.Path], default = None Path to directory where models and intermediate outputs should be saved. If unspecified, a time-stamped folder called "AutogluonModels/ag-[TIMESTAMP]" will be created in the working directory to store all models. Note: To call `fit()` twice and save all results of each fit, you must specify different `path` locations or don't specify `path` at all. Otherwise files from first `fit()` will be overwritten by second `fit()`. verbosity : int, default = 2 Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via `logger.setLevel(L)`, where `L` ranges from 0 to 50 (Note: higher values of `L` correspond to fewer print statements, opposite of verbosity levels). Verbosity levels: 0: Only log exceptions 1: Only log warnings + exceptions 2: Standard logging 3: Verbose logging (ex: log validation score every 50 iterations) 4: Maximally verbose logging (ex: log validation score every iteration) log_to_file: bool, default = True Whether to save the logs into a file for later reference log_file_path: str, default = "auto" File path to save the logs. If auto, logs will be saved under `predictor_path/logs/predictor_log.txt`. Will be ignored if `log_to_file` is set to False sample_weight : str, default = None If specified, this column-name indicates which column of the data should be treated as sample weights. This column will NOT be considered as a predictive feature. Sample weights should be non-negative (and cannot be nan), with larger values indicating which rows are more important than others. If you want your usage of sample weights to match results obtained outside of this Predictor, then ensure sample weights for your training (or tuning) data sum to the number of rows in the training (or tuning) data. You may also specify two special strings: 'auto_weight' (automatically choose a weighting strategy based on the data) or 'balance_weight' (equally weight classes in classification, no effect in regression). If specifying your own sample_weight column, make sure its name does not match these special strings. weight_evaluation : bool, default = False Only considered when `sample_weight` column is not None. Determines whether sample weights should be taken into account when computing evaluation metrics on validation/test data. If True, then weighted metrics will be reported based on the sample weights provided in the specified `sample_weight` (in which case `sample_weight` column must also be present in test data). In this case, the 'best' model used by default for prediction will also be decided based on a weighted version of evaluation metric. Note: we do not recommend specifying `weight_evaluation` when `sample_weight` is 'auto_weight' or 'balance_weight', instead specify appropriate `eval_metric`. groups : str, default = None [Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging. This column will not be used as a feature during model training. This parameter is ignored if bagging is not enabled. To instead specify a custom validation set with bagging disabled, specify `tuning_data` in `.fit`. The data will be split via `sklearn.model_selection.LeaveOneGroupOut`. Use this option to control the exact split indices AutoGluon uses. It is not recommended to use this option unless it is required for very specific situations. Bugs may arise from edge cases if the provided groups are not valid to properly train models, such as if not all classes are present during training in multiclass classification. It is up to the user to sanitize their groups. As an example, if you want your data folds to preserve adjacent rows in the table without shuffling, then for 3 fold bagging with 6 rows of data, the groups column values should be [0, 0, 1, 1, 2, 2]. **kwargs : learner_type : AbstractLearner, default = DefaultLearner A class which inherits from `AbstractLearner`. This dictates the inner logic of predictor. If you don't know what this is, keep it as the default. learner_kwargs : dict, default = None Kwargs to send to the learner. Options include: positive_class : str or int, default = None Used to determine the positive class in binary classification. This is used for certain metrics such as 'f1' which produce different scores depending on which class is considered the positive class. If not set, will be inferred as the second element of the existing unique classes after sorting them. If classes are [0, 1], then 1 will be selected as the positive class. If classes are ['def', 'abc'], then 'def' will be selected as the positive class. If classes are [True, False], then True will be selected as the positive class. ignored_columns : list, default = None Banned subset of column names that predictor may not use as predictive features (e.g. unique identifier to a row or user-ID). These columns are ignored during `fit()`. label_count_threshold : int, default = 10 For multi-class classification problems, this is the minimum number of times a label must appear in dataset in order to be considered an output class. AutoGluon will ignore any classes whose labels do not appear at least this many times in the dataset (i.e. will never predict them). cache_data : bool, default = True When enabled, the training and validation data are saved to disk for future reuse. Enables advanced functionality in predictor such as `fit_extra()` and feature importance calculation on the original data. trainer_type : AbstractTrainer, default = AutoTrainer A class inheriting from `AbstractTrainer` that controls training/ensembling of many models. If you don't know what this is, keep it as the default. Attributes ---------- path : str Path to directory where all models used by this Predictor are stored. problem_type : str What type of prediction problem this Predictor has been trained for. eval_metric : function or str What metric is used to evaluate predictive performance. label : str Name of table column that contains data from the variable to predict (often referred to as: labels, response variable, target variable, dependent variable, Y, etc). feature_metadata : :class:`autogluon.common.features.feature_metadata.FeatureMetadata` Inferred data type of each predictive variable after preprocessing transformation (i.e. column of training data table used to predict `label`). Contains both raw dtype and special dtype information. Each feature has exactly 1 raw dtype (such as 'int', 'float', 'category') and zero to many special dtypes (such as 'datetime_as_int', 'text', 'text_ngram'). Special dtypes are AutoGluon specific feature types that are used to identify features with meaning beyond what the raw dtype can convey. `feature_metadata.type_map_raw`: Dictionary of feature name -> raw dtype mappings. `feature_metadata.type_group_map_special`: Dictionary of lists of special feature names, grouped by special feature dtype. positive_class : str or int Returns the positive class name in binary classification. Useful for computing metrics such as F1 which require a positive and negative class. In binary classification, :meth:`TabularPredictor.predict_proba` returns the estimated probability that each row belongs to the positive class. Will print a warning and return None if called when `predictor.problem_type != 'binary'`. class_labels : list For multiclass problems, this list contains the class labels in sorted order of `predict_proba()` output. For binary problems, this list contains the class labels in sorted order of `predict_proba(as_multiclass=True)` output. `class_labels[0]` corresponds to internal label = 0 (negative class), `class_labels[1]` corresponds to internal label = 1 (positive class). This is relevant for certain metrics such as F1 where True and False labels impact the metric score differently. For other problem types, will equal None. For example if `pred = predict_proba(x, as_multiclass=True)`, then ith index of `pred` provides predicted probability that `x` belongs to class given by `class_labels[i]`. class_labels_internal : list For multiclass problems, this list contains the internal class labels in sorted order of internal `predict_proba()` output. For binary problems, this list contains the internal class labels in sorted order of internal `predict_proba(as_multiclass=True)` output. The value will always be `class_labels_internal=[0, 1]` for binary problems, with 0 as the negative class, and 1 as the positive class. For other problem types, will equal None. class_labels_internal_map : dict For binary and multiclass classification problems, this dictionary contains the mapping of the original labels to the internal labels. For example, in binary classification, label values of 'True' and 'False' will be mapped to the internal representation `1` and `0`. Therefore, class_labels_internal_map would equal {'True': 1, 'False': 0} For other problem types, will equal None. For multiclass, it is possible for not all of the label values to have a mapping. This indicates that the internal models will never predict those missing labels, and training rows associated with the missing labels were dropped. File: ~/anaconda3/envs/py38/lib/python3.8/site-packages/autogluon/tabular/predictor/predictor.py Type: type Subclasses: _TabularPredictorExperimental, InterpretableTabularPredictor
- “label”:target variable
= TabularPredictor("Survived") predictr
No path specified. Models will be saved in: "AutogluonModels/ag-20230917_135346/"
C. 적합(fit)
-
비유: 학생이 공부를 하는 과정으로 비유할 수 있다.
-
학습
predictr.fit(tr) # 학생(predictr)에게 문제(tr)를 줘서 학습을 시킴(predictr.fit())
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230917_135346/"
AutoGluon Version: 0.8.2
Python Version: 3.8.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 775.59 GB / 982.82 GB (78.9%)
Train Data Rows: 891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 39350.68 MB
Train Data (Original) Memory Usage: 0.31 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
0.1s = Fit runtime
11 features in original data used to generate 28 features in processed data.
Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.16s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f291690c3a0>
Traceback (most recent call last):
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
self._make_module_from_path(filepath)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
module = module_class(filepath, prefix, user_api, internal_api)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
self.version = self.get_version()
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
0.6536 = Validation score (accuracy)
0.03s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f291690c3a0>
Traceback (most recent call last):
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
self._make_module_from_path(filepath)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
module = module_class(filepath, prefix, user_api, internal_api)
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
self.version = self.get_version()
File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
0.6536 = Validation score (accuracy)
0.02s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.8156 = Validation score (accuracy)
0.2s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
0.8212 = Validation score (accuracy)
0.21s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ...
0.8156 = Validation score (accuracy)
0.29s = Training runtime
0.02s = Validation runtime
Fitting model: RandomForestEntr ...
0.8156 = Validation score (accuracy)
0.29s = Training runtime
0.02s = Validation runtime
Fitting model: CatBoost ...
0.8268 = Validation score (accuracy)
0.36s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ...
0.8156 = Validation score (accuracy)
0.28s = Training runtime
0.02s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.8101 = Validation score (accuracy)
0.29s = Training runtime
0.02s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
0.8324 = Validation score (accuracy)
0.45s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
0.8101 = Validation score (accuracy)
0.12s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
0.8212 = Validation score (accuracy)
1.19s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMLarge ...
0.8324 = Validation score (accuracy)
0.41s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8324 = Validation score (accuracy)
0.33s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 4.87s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230917_135346/")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f2899b57f70>
-
리더보드확인 (모의고사 채점)
predictr.leaderboard()
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMLarge 0.832402 0.002893 0.406132 0.002893 0.406132 1 True 13
1 NeuralNetFastAI 0.832402 0.006791 0.452260 0.006791 0.452260 1 True 10
2 WeightedEnsemble_L2 0.832402 0.007321 0.783443 0.000530 0.331183 2 True 14
3 CatBoost 0.826816 0.003689 0.362406 0.003689 0.362406 1 True 7
4 LightGBM 0.821229 0.003210 0.207445 0.003210 0.207445 1 True 4
5 NeuralNetTorch 0.821229 0.007692 1.186131 0.007692 1.186131 1 True 12
6 LightGBMXT 0.815642 0.003147 0.195812 0.003147 0.195812 1 True 3
7 RandomForestEntr 0.815642 0.024688 0.289305 0.024688 0.289305 1 True 6
8 RandomForestGini 0.815642 0.024689 0.285907 0.024689 0.285907 1 True 5
9 ExtraTreesGini 0.815642 0.024699 0.280003 0.024699 0.280003 1 True 8
10 XGBoost 0.810056 0.004088 0.119255 0.004088 0.119255 1 True 11
11 ExtraTreesEntr 0.810056 0.024130 0.290494 0.024130 0.290494 1 True 9
12 KNeighborsUnif 0.653631 0.006800 0.025122 0.006800 0.025122 1 True 1
13 KNeighborsDist 0.653631 0.009379 0.023463 0.009379 0.023463 1 True 2
model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|
0 | LightGBMLarge | 0.832402 | 0.002893 | 0.406132 | 0.002893 | 0.406132 | 1 | True | 13 |
1 | NeuralNetFastAI | 0.832402 | 0.006791 | 0.452260 | 0.006791 | 0.452260 | 1 | True | 10 |
2 | WeightedEnsemble_L2 | 0.832402 | 0.007321 | 0.783443 | 0.000530 | 0.331183 | 2 | True | 14 |
3 | CatBoost | 0.826816 | 0.003689 | 0.362406 | 0.003689 | 0.362406 | 1 | True | 7 |
4 | LightGBM | 0.821229 | 0.003210 | 0.207445 | 0.003210 | 0.207445 | 1 | True | 4 |
5 | NeuralNetTorch | 0.821229 | 0.007692 | 1.186131 | 0.007692 | 1.186131 | 1 | True | 12 |
6 | LightGBMXT | 0.815642 | 0.003147 | 0.195812 | 0.003147 | 0.195812 | 1 | True | 3 |
7 | RandomForestEntr | 0.815642 | 0.024688 | 0.289305 | 0.024688 | 0.289305 | 1 | True | 6 |
8 | RandomForestGini | 0.815642 | 0.024689 | 0.285907 | 0.024689 | 0.285907 | 1 | True | 5 |
9 | ExtraTreesGini | 0.815642 | 0.024699 | 0.280003 | 0.024699 | 0.280003 | 1 | True | 8 |
10 | XGBoost | 0.810056 | 0.004088 | 0.119255 | 0.004088 | 0.119255 | 1 | True | 11 |
11 | ExtraTreesEntr | 0.810056 | 0.024130 | 0.290494 | 0.024130 | 0.290494 | 1 | True | 9 |
12 | KNeighborsUnif | 0.653631 | 0.006800 | 0.025122 | 0.006800 | 0.025122 | 1 | True | 1 |
13 | KNeighborsDist | 0.653631 | 0.009379 | 0.023463 | 0.009379 | 0.023463 | 1 | True | 2 |
D. 예측 (predict)
-
비유: 학습이후에 문제를 푸는 과정으로 비유할 수 있다.
-
training set 을 풀어봄 (predict) \(\to\) 점수 확인
== predictr.predict(tr)).mean() (tr.Survived
0.8810325476992144
== (tr.Sex == "female")).mean() # 예전점수와 비교 (tr.Survived
0.7867564534231201
-
test set 을 풀어봄 (predict) \(\to\) 점수 확인 하러 캐글에 결과제출
= predictr.predict(tst)).loc[:,['PassengerId','Survived']]\
tst.assign(Survived "autogluon_submission.csv",index=False) .to_csv(